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Abstract 

Given n points in a d dimensional Euclidean space, the Minimum 
Enclosing Ball (MEB) problem is to find the ball with the smallest 
radius which contains all n points. We give a 0(ndQ/^/e) approxima- 
tion algorithm for producing an enclosing ball whose radius is at most 
e away from the optimum (where Q is an upper bound on the norm of 
the points). This improves existing results using coresets, which yield 
a 0(nd/e) greedy algorithm. Finding the Minimum Enclosing Convex 
Polytope (MECP) is a related problem wherein a convex polytope of 
a fixed shape is given and the aim is to find the smallest magnifica- 
tion of the polytope which encloses the given points. For this problem 
we present a 0(mndQ/e) approximation algorithm, where m is the 
number of faces of the polytope. Our algorithms borrow heavily from 
convex duality and recently developed techniques in non-smooth op- 
timization, and are in contrast with existing methods which rely on 
geometric arguments. In particular, we specialize the excessive gap 
framework of Nesterov [2005b] to obtain our results. 
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1 Introduction 

Given a set S = {xi, X2, . . . , x n } of n points in the minimum enclosing 
ball (MEB) is the ball with the smallest radius which contains all the points 
in S. The problem of finding a MEB arises in application areas as diverse 
as data mining, learning, statistics, computer graphics, and computational 
geometry [Elzinga and Hearn, 1972]. Therefore efficient algorithms for this 
problem are not only of theoretical interest, but also have wide practical 
applicability. 

Exact algorithms for finding the MEB typically have an exponential 
dependence on d [Megiddo, 1984, Welzl, 1991]. For example, the Welzl [1991] 
algorithm runs in 0(n(d + l)(d+ 1)!) time which makes it inadmissible for 
many practical applications; in the case of linear SVMs data may have a 
million or more dimensions. Therefore, there has been a significant interest 
in finding approximation algorithms for this problem. 

State of the art approximation algorithms for the MEB problem exten- 
sively use the concept of coresets [Clarkson, 2008, Badoiu and Clarkson, 
2002, Panigrahy, 2004, Yildirim, 2008]. Given an e > 0, an e-coreset S' C S 
has the property that if the smallest enclosing ball containing S' is expanded 
by a factor of (1+e), then the resulting ball also contains S. Therefore, locat- 
ing an e-coreset is equivalent to finding an (1 + e) approximation algorithm 
to the MEB problem. The approximation guarantees of such algorithms are 
multiplicative. Briefly, a coreset is built incrementally in a greedy fash- 
ion [Clarkson, 2008]. At every iteration, the MEB of the current candidate 
coreset is built. If every point in S lies in an (1 + e) ball of the current 
solution then the algorithm stops, otherwise the most violated point, that 
is, the point which is furthest away from the current MEB is included in the 
candidate coreset and the iterations continue. The best known algorithms 
in this family have a running time of 0{nd/e) [Panigrahy, 2004, Clarkson, 
2008]. 

In contrast, we present a new algorithm which is derived by casting 
the problem of finding the MEB as a convex but non-smooth optimization 
problem. By specializing a general framework of Nesterov [2005b], our algo- 
rithm is able to achieve a running time of 0(ndQ/y/e) where Q is an upper 
bound on the norm of the points. Also, the approximation guarantees of 
our algorithm are additive, that is, given a tolerance e > and denoting 
the optimal radius by R*, our algorithm produces a function whose value 
lies between R* 2 and R* 2 + e. Although these two types of approximation 
guarantees seem different, by a simple argument in section 3.2, we show 
that our algorithm also yields a traditional scale-invariant e multiplicative 
approximation with 0(ndQ/^/e) effort. 



We extend our analysis to the closely related minimum enclosing convex 
polytope (MECP) problem, and present a new algorithm. As before, given 
a set S = {xi,X2, . . . , x ra } of n points in M. d , the task here is to find the 
smallest polytope of a given fixed shape which encloses these points. In 
our setting translations and magnifications are allowed but rotations are 
not allowed. We present a 0(mndQ/e) approximation algorithm, where m 
denotes the number of faces of the polytope. 

We apply our algorithms to two problems of interest in machine learning 
namely finding the maximum margin hyperplane and computing the dis- 
tance of a polytope from the origin. A coreset algorithm for the first problem 
was proposed by Har-Peled et al. [2007] while the second one was studied 
by Gartner and Jaggi [2009]. In both cases our algorithms require fewer 
number of iterations and yield better computational complexity bounds. 

Our paper is structured as follows: In Section 2 we introduce notation, 
briefly review some results from convex duality, and present the general 
framework of Nesterov [2005b]. In Section 3 we address the MEB problem 
and in Section 4 the MECP problem, and present our algorithms and their 
analysis. We discuss some applications of our results to machine learning 
problems in Section 5. The paper then concludes with a discussion and 
outlook for the future in Section 6. Technical proofs can be found in Ap- 
pendix A and B, while preliminary experimental evaluation can be found in 
Appendix D. 



2 Definitions and Preliminaries 

In this paper, lower bold case letters (e.g., w, /x) denote vectors, while upper 
bold case letters (e.g., A) denote matrices or linear operators. We use Wi to 
denote the i-th component of w, A{j to denote the (i,j)-th entry of A, and 
(w, w') := Ylii w i w 'i t° denote the Euclidean dot product between vectors w 
and w'. A/% denotes the k dimensional simplex. Unless specified otherwise, 

||-|| refers to the Euclidean norm ||w|| := (w, w) = ^Y^ = \wf)^ ■ For a 
matrix A £ M raxd , we have the following definition of the norm 

|| A || = max {(Aw, u):||w|| = l,||u|| = l}. 
We also denote 1:=1U {oo}, and [t] := {1, . . . , t}. 

Definition 1 Let Qi C R n , / : Q 1 — > W, and f* : = min we Q 1 /(w) < oo. A 
point w' G Qi such that 

/(w')</* + e (1) 



is said to be an e-accurate minimizer of f. We will also sometimes call w' 
an e-accurate solution. 

The following three standard concepts from convex analysis (see e.g. Hiriart- 
Urruty and Lemarechal [1993]) are extensively used in the sequel. 

Definition 2 A convex function f : W 1 — > K is strongly convex with respect 
to a norm || • || if there exists a constant p > such that f — % || • || 2 is convex, 
p is called the modulus of strong convexity of f , and for brevity we will call 
f p-strongly convex or p-s.c. 

Definition 3 Suppose a function f : W 1 — > K is differ entiable on Q C M. n . 
Then f is said to have Lipschitz continuous gradient (Leg) with respect to 
a norm \\ ■ \\ if there exists a constant L such that 

||V/(w)- V/(w')|| <-L||w-w'|| Vw,w'gQ. (2) 

For brevity, we will call f L-l.c.g. 

Definition 4 The Fenchel dual of a function f : W 1 — >• M is a function 
f* : W l 1 defined by 

/*(w*)= sup {( W ,w*)-/(w)} (3) 

w£l™ 

Strong convexity and Lipschitz continuity of the gradient are related by 
Fenchel duality according to the following lemma: 

Lemma 5 (Hiriart-Urruty and Lemarechal [1993, Theorem 4.2.1 and 4.2.2]) 

1. If f : W 1 — > R is p-s.c, then f* is finite on R n and f* is - p -l.c.g. 

2. If f : W 1 — > M is convex, differentiable on M n , and L-l.c.g, then f* is 
l-s.c. 

2.1 Nesterov's Framework 

In sections 3 and 4 we will show that the MEB and MECP problems respec- 
tively can be cast as minimizing convex non-smooth objective functions. In 
a series of papers, Nesterov [Nesterov, 1983, 2005a,b] proposed a general 
framework for this task, which we now briefly review. 



Let Q\ and Q2 be subsets of Euclidean spaces and A be a linear map 
from Qi to Q2. Suppose / and g are convex functions denned on Q\ and Q2 
respectively, and we are interested in the following optimization problem: 

min J(w) where J(w) := /(w) + j*(Aw) = /(w) + max {(Aw, u) — g(u)} . 
weQi ueQ2 

(4) 

We will make the following standard assumptions: a) Q2 is compact; b) 
with respect to a certain norm on Q±, the function / defined on Q\ is p-s.c. 
but not necessarily I. eg, and c) with respect to a certain norm on Q2, the 
function g defined on Q2 is Lg-l.c.g and convex, but not necessarily strongly 
convex. 

The key difficulty in solving (4) arises because g* and hence J may be 
non-smooth. Our aim is to uniformly approximate J(w) with a smooth and 
strongly convex function. Towards this end let d be a cr-s.c. smooth function 
with the following properties: 

min d(u) =0, uo = argminci(u), and D : = max d(u). 
ue<22 ue Q 2 ueQ2 

In optimization parlance d is called a prox-function. For a positive constant 
define 

</ M (w) := /(w) + max {(Aw, u) -5(11) - /xrf(u)} . (5) 
ue<22 

It can be easily verified that J M is not only smooth and convex but also 
^-\\ A\\ 2 -l.c.g [Nesterov, 2005a]. Furthermore, if T> < 00 then is uniformly 
close to J, that is, 

J M (w) < J(w) < J M (w) + fiV . (6) 

If some mild constraint qualifications hold [e.g. Theorem 3.3.5 Borwein and 
Lewis, 2000] one can write the dual D(u) of J(w) using A T (the transpose 
of A) as 

D(u) := -g(u) - /*(-A T u) = -g(u) - max {(-Aw, u) - /(w)} , (7) 

weQi 

and assert the following: 

inf J(w) = sup .D(u), and J(w) > D(u) V w G Qi,u e Q 2 . (8) 
w eQi ueQ 2 



The key idea of excessive gap minimization pioneered by Nesterov [2005b] 
is to maintain two estimation sequences {w^} and {u^}, together with a 
diminishing sequence {/ifc} such that 



^M fc (wfe) < D(u k ), and lim fi k = 0. 

k— >OD 



(9) 



The idea is illustrated in Figure 1. In conjunction with (8) and (6), it is 
not hard to see that {w^} and {u&} approach the solution of min w J(w) = 
max u D(u). Using (6), (5), and (9), we can derive the following bound on 
the duality gap: 

J(w fe ) - D(u fc ) < J M (w fe ) + n h V - D(u k ) < fikV. (10) 

In other words, the duality gap is reduced at the same rate at which 
approaches 0. To turn this idea into an implementable algorithm we need 
to answer the following two questions: 

1. How to efficiently find initial points wi, ui and /xi that satisfy (9). 

2. Given Wfe, u^, and n^, how to efficiently find iterates Wfe+i, Ufc+i, and 
Hk+i which maintain (9). To achieve the best possible convergence 
rate it is desirable to anneal as fast as possible while still allowing 
w/; and Ufc to be updated efficiently. 

We will now show how the MEB and MECP problems can be cast as convex 
optimization problems and derive implementable algorithms by answering 
the above questions. 



3 Minimum Enclosing Ball 

Given a set of n points S = {xi, . . . ,x n } in a d dimensional space IR rf , a 
Euclidean ball B(c,R) of radius R centered at c is said to be an enclosing 
ball if Mi G [n], X; G B(c, R). 

3.1 Formulation as an Optimization Problem 

Clearly, x, G B(c,R) if, and only if, ||c — Xj|| 2 < R 2 . Using this observation, 
the MEB problem can be cast as the following optimization problem: 

mini? s.t. lie — Xjll 2 < R 2 Vi, 




Figure 1: Illustration of excessive gap. By strong convexity, the dual D(u) 
is always a lower bound to the primal J(w) (left). We approximate J(w) by 
a smooth lower bound J w (w) (middle). Since D(uk) is sandwiched between 
J/j, k (w) and J(w), as — >■ we get closer and closer to the true optimum 
(right). 



which in turn can be reformulated as 



mm max c 



X; 



Rearranging terms 



min J(c) 



+ max {-2 (c,Xj) + llxjll 2 ) 



(11) 



c 1 1 2 + max {(Ac, u) + (u, b)} , 



ueA 



(12) 



where A = — 2[xj,X2 . . . x n ] T , and b{ = ||xj|| 2 . Clearly, J(c) can be identified 
with (4) by setting Q x = R d , Q 2 = A n , /(c) = ||c|| 2 , and g(u) = - (u,b). 
It can be verified that g is 0-l.c.g, while / is 2-s.c. Therefore, one can 
employ Nesterov's framework (Section 2.1) to minimize J(c). However, as 
we stated before, we need to specialize the framework to our setting to obtain 
an efficient and implementable algorithm. Towards this end note that the 
Fenchel dual of ||-|| 2 is \ ||-|| 2 , and use (7) to write the dual of (12) as 

£>(u) = (u,b> ' 



u AA u. 

By using the Cauchy-Schwartz inequality, the gradient 

1 



VD(u) = b 



-AA'u, 



(13) 
(14) 



can be shown to satisfy 
||VD(ui)- VD(u 2 )|| = 



^AA T ui + ^AA T u 2 



< 



AA 



ui - u 2 



(15) 



thus establishing that D(u) is \ || AA T ||-Z.e.<?. We define L = \ ||AA T ||. 

Next we turn our attention to the prox-function. Recall that Nesterov's 
framework requires a a-s.c. prox-function on Q2 = A n ; in our case we set 

<T „ ,,9 



d(u) 



u - Uo 



where uo = (^,...,-) G M n . For this choice Uo = argmin ugAn <i(u), 
a! (uq) = 0, and 



T> = max d(u) = — max ||u — uo|| 2 = — 
ueA n 2 ueA„ 2 



1 



1 



< 



a 



nj 2 

Furthermore, for notational convenience, define the following three maps: 
£LLixii \ \ac, u; -f ||C|' 2 

c 



(16) 



c(u) = argmin \ (Ac, u) + ||c|| 2 > (17) 
ceR d ' 

u M (c) = argmax {(Ac, u) + (u, b) — fid(u)} = argmin j ^— ||u — uo|| 2 — (Ac + b, u)| . 



(18) 



V(u) = argmin <j — ||v — u|| 2 — (VD(u),v — u) 



argmin 
veA„ 



L, 



v — u 



b AA u, v — u 

2 



(19) 



With this notation in place we now describe our excessive gap minimization 
method in Algorithm 1. Unrolling the recursive update for jik yields 



Mfc = (1 — Tfc-l) Hk-1 



k 



k + 2 



Mfe-l 



(k)(k-l) 



,2 L 



6 



L 



(k + 2)(k + l)..Aa (k + 1) (Jfe + 2) a ' 

(20) 



Plugging this into (10) and using (16) immediately yields the following the- 
orem: 

Theorem 6 (Duality gap) The sequences {c^} and {u^,} in Algorithm 1 
satisfy 

6LT> 3L 



J(c fc ) - D(u k ) < 



< 



a(k + l)(k + 2) ~ (fc + l)(fe + 2)' 



(21) 



As is standard [see e.g. Yildirim, 2008], if we assume that the input data 
points lie inside a ball of radius Q, that is, maxj ||xj|| < Q then we can write 



L 



AA 



- max c T AA T u = - max 

2 || c ||=||uj|=i 2 || u ||=i 



Au 



2 max ||xj|| 2 = 2Q 2 . 



(22) 



Algorithm 1: Excessive gap minimization applied to MEB 
Output: Sequences {c^}, {u^}, and {/J-k} that satisfy (9), with 
limfc^oo (i k = 0. 

1 Initialize: Let uq = (i, ...,£), /ii = £, ci = c ( u o), ui = ^(uo).; 

2 for k = 1, 2, . . . do 



Tk k+3- 

(3 k 4- (1 - r fe )u fc + T fe u Mfc (cfc) 

C fe+ i <- (1 - T k )c k + T k c((3 k ) 

u k+1 <- V((3 k ). 



The last equality follows because A = — 2[xi,X2 . . . x n ] T and the maximum 
is attained by setting u = ej where j = argmax^ ||xj|| . Note that this is just 
a conservative estimate and a larger value of L also guarantees convergence 
of the algorithm. Plugging this back into (21) yields 

J(c t )-P(u t )< (t + ^ + 2) . (23) 

Therefore, to obtain an e accurate solution of (12) it suffices to ensure that 

6Q 2 



(fc + !)(& + 2) 



< e. (24) 



Solving for k yields the 0{Q/y/e) bounds on the number of iterations as 
claimed. All that remains is to show that 

Theorem 7 The update rule of Algorithm 1 guarantees that (9) is satisfied 
for all k > 1. 

Proof See Appendix A. ■ 



Each iteration of Algorithm 1 requires us to compute c(u), u^(c), and 
V(u) (see (17), (18), and (19)). All other operations either require constant 
or linear time. We now show that each of these three maps can be com- 
puted in 0{nd) time. This in conjunction with Theorem 6 shows that the 
time complexity of our algorithm to find an e accurate solution of (12) is 
Oind/^e). 



By computing the gradient of ( Ac, u) + ||c|| 2 and setting it to zero we 
can show that 

c(u) = "^A T u. (25) 

Since A is a n x d matrix computing c(u) takes 0{nd) time. On the other 
hand, computation of u„(c) can be cast as the following Quadratic program- 
ming (QP) problem with linear constraints: 

min — — llull 2 — (Ac + b + /xcruo, u) (26) 
u 2 

s.t. u i = 1 an d < Ui < 1. 

i 

Computing Ac + b + fiau^ requires 0(nd) time. Given Ac + b + /xcuo, we 
show in Appendix B that the above QP can be solved in 0{n) time. Finally, 
after some simple algebraic manipulation computation of V(u) can be also 
be cast as a Quadratic programming (QP) problem with linear constraints 
as follows: 

^AA T u + b, (27) 
1 and < Vi < 1. 

Again, the computational bottleneck is in computing Lu- |AA T u+b which 
takes 0{nd) effort a . Given Lu — ^AA T u + b the algorithm in Appendix B 
can be applied to solve (27) in 0{n) time. 

3.2 Multiplicative versus Additive Approximation: Scale In- 
variance 

Existing approximation algorithms for the MEB problem based on coresets 
provide a multiplicative approximation. Given a set of points, the coreset 
algorithms output a center c and radius R such that all the given points lie 
inside the ball of radius R(l + e) centered at c. In contrast, our algorithm 
produces a center and a radius R such that R* 2 < R 2 < R* 2 + e', where R* 
denotes the radius of the optimal minimum enclosing ball. 

At first glance the two types of guarantees do not look directly compara- 
ble since the additive guarantees seem to vary with change of scale. To pro- 
duce an em scale invariant multiplicative approximation with our algorithm, 




i 



a To compute AA u efficiently we first compute a = A u and then compute Aa. 



set e' = e M R* 2 - In view of (24) it follows that R* 2 < R 2 < R* 2 (l + e M ) 
whenever 13 

6Q 2 9 

< R* 2 e M - (28) 



(k + l)(k + 2) 
Solving for k obtains 

Q I 6 



R V £M 

However, since i£* is unknown this bound cannot be used as a practical 
stopping criterion. Instead, one can use the following observation: select a 
arbitrary pair of points Xj and Xj from S and compute | ||xj — Xj||. Denote 
this quantity by V, and let c* be the center of the optimal MEB. Clearly 

V = - llxj - xJI < - llxj - c* II + - ||x,- — c* || = -R* + -R* = R*. (30) 
2 3 " ~ 2 11 2 J 2 2 

Therefore replacing i?* by V in (29) yields the following computable upper 
bound on the number of iterations: 




(31) 

This shows that 0(1/ ^/cm) iterations of our algorithm suffice to produce a 
eM- m ultiplicative approximation. 

Note that just like in the case of coreset based algorithms this bound is 
scaling invariant, that is, if all the Xj in S are scaled by a factor a > the 
bound still holds. To see this one merely has to observe that after scaling 
V becomes aV and Q becomes aQ, but the ratio Q/V which appears in 
(31) remains unchanged. Thus while our algorithm is modeled as a generic 
optimization formulation, it is possible to obtain scale invariance by appro- 
priately choosing e' to be euV 2 . 



4 Minimum Enclosing Convex Polytope 

In the Minimum Enclosing Convex Polytope (MECP) problem we are given a 
polytope of fixed shape which can be translated and magnified but rotations 
are not allowed. Furthermore, we assume that the polytope has a finite 



b We prove our bounds in terms of R 2 but it is trivial to convert this to a bound in 
terms of R. 



number of faces and hence it can be expressed as an intersection of a finite 
number of hyperplanes: 



(wj,x-c) <ii i = 1,2, ....to, 

where c is the center of the polytope about which it is magnified. Given a set 
of points S = {xi, X2, . . . , x„} we want to find the minimum magnification 
of the convex polytope that encloses all the points in S. 



4.1 Formulation as an Optimization problem 

Clearly an enclosing polytope is one for which (wj, Xj — c) < t% for all i and 
j. This observation helps us to cast the MECP problem as the following 
optimization problem 

min R s.t. (wj,x,- — c) < Rb; 
ReM,ceQi 

or equivalently as 

min R s.t. (—,Xj — c)<R Vi, j. 
Rm,ceQi \ U J j 

Here R is the scale of magnification of the polygon and Q\ C M rf is a bounded 
set, for example, a ball of certain fixed radius Q centered at the origin which 
is assumed to contain the solution c. Usually Q\ is taken to be a ball which 
contains all the points in S but this need not always be the case. Also we 
will assume that Wj/£j lie inside a ball of radius W. Writing Wj = Wj/ij the 
problem can be expressed as, 

min max(wj,Xj — c) = min max > tijj((wj,Xj — c)). (32) 

cGQi i,j ceQi u6A mn z — ' 

Notice that, as before, we have replaced the maximization over a finite set 
by a maximization over the simplex. Clearly this problem can be rewritten 
as 

min J(c) where J(c) = max N Uy((wj, Xj — c)) = max {(Ac, u) + (u, b)} . 

(33) 

We used A = [ — wi , . . . — W2, — w m ] T , a mti x d matrix with each 

n times n times 

Wj repeated n times as the columns and b a mn dimensional vector with 



bij = (wj,Xj) to write the above expression. Clearly J(c) can be identified 
with (4) by setting Q2 = A mn , /(c) = and g(u) = — (b,u). Here, g is 
0-l.c.g, however, / is no longer strongly convex. Therefore, we will work 
with the following function 

Jrj(c) = 77 ||c|| 2 + max {(Ac, u) — <?(u)} . 

ueA mn 

Let J* = mincgQj J v (c) and J* = min cg Q 1 J(c). Since ||c|| 2 < Q 2 for all 
c G Qi we have that 

J* < J* + r]Q 2 . 

Suppose we can minimize to e/2 precision, that is, find a c such that 
Jrj( c ) < ^ + e /2 then the above observation allows us to write the following 
series of inequalities 

J(c) < j v (c)< j; + e -< J* + e - + v Q 2 . 

In other words, every e/2 accurate solution of J v is a e/2+r/Q 2 accurate solu- 
tion of J. In particular, if we set r\ = e/2Q 2 then every e/2 accurate solution 
of J v is an e accurate solution of J. Furthermore, J v is nearly identical to 
(12) except that ||c|| 2 is now replaced by ^2 || c l| 2 - Consequently, Algorithm 
1 can be directly applied to minimize with the following changes: 



D v (u) 
VD„(u) 
V(u) 



(u,b)- 

, Q 2 



argmm 
veA m „ 



Q 2 
2e 



u T AA 1 u. 



T, 



AA'u. 



argmm 
veA mn 



v - u|| 2 - (VD v (u),v- u) 



u 



Q 



AA u, v 



u 



2 " \ e 

A simple application of the Cauchy-Schwartz inequality shows that 



|VA,(ui) - VD„(u 2 ) 



Q 2 t Q 2 T 
— AA ui + — AA U2 

e e 



< 



Q 



AA 



(34) 
(35) 

(36) 



||ui - 
(37) 



thus establishing that D v (u) is ^ || AA T ||-/.c.5. We define = ^ ||AA T ||. 
Since Wj are assumed to lie inside a ball of radius W, by an argument anal- 
ogous to (22), it follows that L v = Q 2 W 2 /e. Plugging this into (21) obtains 

3Q 2 W 2 



J(c fe ) - D(u k ) < 



e(fc + l)(fe + 2)' 



(38) 



In order to obtain an e/2 accurate solution we need to solve for k by setting 

€{ k+i){k+2) ^ e / 2 - This y ields k ^ V6QW/e, which shows the 0(Q/e) 
iteration bound as claimed. Extending the arguments for MEB to the MECP 
case, the per iteration complexity of 0{mnd) can readily be established. We 
omit details for brevity. 



5 Applications to Machine Learning 

The connection between the MEB problem and SVMs has been discussed 
in a number of publications [Clarkson, 2008, Har-Peled et al., 2007, Gartner 
and Jaggi, 2009]. Practical algorithms using coresets were also proposed 
in Tsang et al. [2007] and Tsang et al. [2005]. In all these cases, our im- 
proved MEB algorithm can be plugged in as a subroutine and will yield 
corresponding speedups. We describe these kernel based algorithms and 
their connection with MEB in appendix C. In this section, we present two 
machine learning problems wherein our algorithms lead to better bounds 
than existing coreset based approaches. 



5.1 Finding Large Margin Classifiers 

In Har-Peled et al. [2007] a coreset algorithm (Coreset SVM) for finding 
the maximum margin hyperplane was described. It turns out that our 
MECP algorithm can be specialized to their setting, and yields improved 
bounds. Briefly, given m labeled data points (zj,yj) with Zj 6 W 1 and 
Hi £ {±1} the maximum margin hyperplane can be found by solving 
argmax cg]R d j || c || =1 minj (zj, c). Equivalently, by defining Wj = j/j • Zj one 
can solve 

argmin max (wj, -c) . (39) 
ceK d ,||c||=i » 

The above problem can be identified with (32) if we set S to be the empty 
set and Qi to be {c G R d s.t. ||c|| = l}. With this substitution our MECP 
algorithm can directly be applied to solve (39). In this case Q = 1 and the 
bound (38) reduces to 

3W 2 e 
e(fc + l)(ife + 2) " 2' ( j 



c Har-Peled et al. [2007] use for the training data and w for the hyperplane. Here 
we use Zj and c respectively to be consistent with our notation. 



Solving for k shows that \/6 iterations suffice to obtain an e accurate 
solution of (39). 

The Coreset SVM algorithm produces a multiplicative approximation. 
To compare with the bounds given by Har-Peled et al. [2007] we follow the 
same scheme described in Section 3.2. Let p* denote the optimal margin, 
that is, min cgK d || c |j =1 maxj (wj, — c), and set e = p*€m- Substituting into 
(40) and solving for k shows that to produce a €m multiplicative approxi- 
mation with our algorithm 

Vg(— J — <k. (41) 
In contrast, Har-Peled et al. [2007] compute a Coreset SVM C of size 

in |C| iterations. Furthermore, the computational complexity of Coreset 
SVM is 0{md\C\ + |C|T(|C|)), where T(\C\) is the cost of training a SVM 
on |C| points. Since each iteration requires only 0(md) effort, our algorithm 

w\ 1 



has an improved computational complexity of O ymd y^j T^jj- However, 
there is one notable difference between the two algorithms. Coreset SVM 
produces a sparse solution in terms of the number of support vectors, while 
our algorithm comes with no such guarantees. 



5.2 Computing Polytope Distance 

Given a set of points S = {xi,X2, . . . ,x n } the polytope distance is defined 
as the shortest distance p of any point inside the convex hull of S, conv(<S), 
to the origin [Gartner and Jaggi, 2009]. Equivalently, we are looking for 
the vector with the smallest norm in conv(S'). A variant is to compute the 
distance between two polytopes given by the points S+ = {zi, Z2, . . . , z n } 
and iSL = { z[ , z' 2 , . . . , z' n , } . This problem can be solved by finding the 
polytope distance of the Minkowski difference of conv(S+) and conv(5_) 
[Bennett and Bredensteiner, 1998]. Since the arguments for both cases are 
by and large very similar we will stick with the simpler formulation in this 
paper. The polytope distance problem has a number of applications in 
machine learning. A partial (and by no means exhaustive) list of relevant 
publications includes Gartner and Jaggi [2009], Bennett and Bredensteiner 
[1998], and Keerthi et al. [2000]. 



To analyze this problem in our setting we start with a technical lemma. 
A version of this lemma also appears as Theorem A. 2 in Bennett and Bre- 
densteiner [1998]. 

Lemma 8 The following two problems are duals of each other: 

min J(c) := max (c — Xj, c) (43) 

c i 

1 12 

max D(u) := --u T AA T u = -- A T u , (44) 
ueA„ 4 4 

where A = -[xi,x 2 , . . . ,x n ] T . 

Proof First rewrite the objective function in (43) as 

J(c) = ||c|| 2 + max (— Xj, c) = ||c|| 2 + max (u, Ac) . 

i ueA„ 

This can be identified with (4) by setting Q\ =R d , Q2 = A n , /(c) = ||c|| 2 , 
and g(u) = 0. The dual problem (44) can directly be read off from (7) by 

1 1 2 1 1 1 2 

noting that that the Fenchel dual of ||-|| is ^ IHI • ' 

Clearly (44) computes the vector with the smallest norm in conv(S'), which is 
equivalent to the polytope distance problem. Furthermore, (43) and (44) are 
identical to (12) and (13) respectively with b = 0. Therefore the algorithm 
we described in Section 3 can be applied with minor modifications to yield 
a 0{nd/ '\/e) algorithm for this problem also. Since the Gartner and Jaggi 
[2009] algorithm selects coresets, at most 0(1/ e) components of u are non- 
zero. However, in our case no such guarantees hold. 



6 Conclusions and Future Work 

We presented a new approximation algorithm for the MEB problem whose 
running time is 0(ndQ/^/e). Unlike existing algorithms, which rely heavily 
on geometric properties, our algorithm is motivated and derived purely from 
a convex optimization viewpoint. We extended our analysis to the MECP 
problem and obtain a 0(mndQ/e) algorithm. Not only does our treatment 
yield an algorithm with better bounds, but preliminary experimental results 
in Appendix D suggest that our algorithm is competitive on problems with 
a large number of data points. 

A more general version of the MECP problem was studied by Panigrahy 
[2004]. In his setting the convex polytope is allowed translations, magnifica- 
tions, and rotations. A simple greedy approach (very reminiscent of coresets) 



yields an (1+e) multiplicative approximation algorithm which takes 0(1/ e 2 ) 
iterations to converge. Please consult Panigrahy [2004] for details. 

A natural question to ask is the following: Can our algorithms be ex- 
tended to deal with arbitrary convex shapes? In other words, given an ar- 
bitrary convex shape can we find the optimal magnification and translation 
that is required to enclose the set of points {xj}™ =1 at hand. Unfortunately, 
a straightforward extension seems rather difficult. Even though it is well 
known that every convex shape can be described as an intersection of half 
planes, the number of half planes need not be finite. In such a case our 
MECP algorithm, which crucially relies on the number of half planes being 
finite, is clearly not applicable. Somewhat surprisingly, we are able to ob- 
tain a 0(1/ y/e) algorithm for the MEB problem, even though a ball is made 
up of an intersection of infinitely many half planes. Clearly, the strong 
convexity of the objective function plays an important role in this context. 
We are currently trying to characterize such problems in the hope that this 
investigation will lead to efficient algorithms for a number of other related 
problems. 

Rotations are a natural concept when working with geometric algo- 
rithms. This does not naturally carry over to our setting where we use 
convex optimization. In order to introduce rotations one has to work with 
orthogonal matrices, which significantly complicates the optimization strat- 
egy. A fruitful pursuit would be to investigate if the insights gained from 
coresets can be used to solve complicated optimization problems which in- 
volve orthogonal matrices efficiently. 

Even though we are only beginning to scratch the surface on exploring 
connections between optimization and computational geometry, we firmly 
believe that this cross pollination will lead to exciting new algorithms in 
both areas. 
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Appendix 

A Proof of theorem 7 

Our proof is by and large derived using results from Nesterov [2005b]. We 
begin with a technical lemma. 

Lemma 9 (Lemma 7.2 of Nesterov [2005b]) For any u and u, we have 

D(u) + (VD(u),u-u) = (Ac(u) + b,u) + ||c(u)|| 2 . 
Proof Direct calculation by using (13), (14), and (25) yields 

D(u) + (Vfl(u),u - u) = (u, b) - Ju T AA T u + ^b - ^AA T u, u - u 

= <u,b> - ^AA T u,u|) + |u T AA T u 
= (Ac(u) + b,u) + ||c(u)|| 2 . 



We first show that the initial wi and cti satisfy the excessive gap condition 
(9). Since — D is L-/.c.g(from (15)), so 

Z>(ui) > D(u ) + (VD(uq), Ul - u ) - - [|ui - u f 

r l 2 

(using defn. of ui and (19)) = max < D(uq) + (V-D(uo), u — Uo) — — ||u — uo II 

ueA„ [ 2 

(using lemma 9) = max < (u, b) + (Ac(uo), u) + ||c(uo) || 2 — — ||u — uo|| 2 

u6A„ I 2 

I ^— 1 1 u — u o 



(using defn. of d and = max \ (u, b) + (Ac(uo),u) + ||c(uq) 



(using ci = c(u )) = ||ci|| 2 + max {(Aci, u) + (u, b) - ||u - uo|| 2 

ueA n L 2 

> ^Mi( c l) 

which shows that our initialization indeed satisfies (9). Second, we prove by 
induction that the updates in Algorithm 1 maintain (9). We begin with two 
useful observations. Using (20) and the definition of r^, one can bound 

6 L L 

W+i = (1 - = {k + m + 2) ~ > rl~. (45) 



Let 7 := u flk (c k ). The optimality conditions for (18) imply (//fe0"(7 — uo) — Ac^ — b, u — 7) > 
and hence 

^fccr (7-110,11-7) > ( Ac fe + b, u - 7) . (46) 

1 1 2 

By using the update equation for c^+i and the convexity of ||-|| 

J^ k +A c k+i) = ||cfe+i|| 2 + max {(Ac fe+ i,u) + (u, b) - ^ k+l(T || u - u || 2 ) 

ueA„ k 1 ) 

= ||(1 - T k ) Ck + T kC (p k )\\ 2 

+ max {(1 - T fc ) (Ac fe ,u) + r fc (Ac(/3 fc ),u) + (u, b) - (1 - r k )^- ||u - u || 2 j 

ueA n I. A ) 



< max {(1 - r fe )Ti + T k T 2 } 
ueA n 



where 



T-2 

Ti can be bounded as follows 



u- u || + (Ac fc + b,u) + (Icel- 



and 



(Ac(/3 fc )+b,u) + ||c(/3 fe )|| 2 



Ti = ll u - u o|| 2 + (Ac fc + b, u) + Hcfcll 2 

= — ^-Il u -7ll — ^— IIt — u-oll - VkO (7 - u ,u - 7) 



+ (Ac fe + b,u) + ||c fe || 



2 



(using (46)) < || U - 7|| 2 "of - (Ac fe + b, u - 7) 



+ (A Cjfc + b,u) + ||c fc || 



2 



, } J" - 7II - IIt - "oil + (Ac fc + b, 7 > + ||c fe | 
(using defn. of 7) = --^ ||u - -y 1 1 2 + -W( w fc) 
(using induction assumption) < — ||u — 7|| 2 + D(u k ) 

(using concavity of D) < -^f ||u - 7 || 2 + D{(3 k ) + (VD{/3 k ), u k - (3 k ) , 

while T2 can be simplified by using Lemma 9: 

T 2 = (u, b) + <Ac(/3 fc ), u) + ||c(/3 fc )|| 2 = D((3 k ) + {VDQ3 k ),u - (3 k ) . 



Putting the upper bounds on T\ and T2 together, and using (45) we obtain 
the following result. 

•Wi( c *+i) < max {(1 - Tk ) ||u - 7 || 2 + D((3 k ) + {VD(f3 k ),u k - (3 k ) 

+ r k [D(J3 k ) + (VD(ft),u - (3 k )]} 

< D((3 k ) + max { -r fc 2 ^ ||u - 7 || 2 + (VD((3 k ), (1 - r k )u k + r fc u - f3 k ) 
ueA n { I 

(47) 

Let v = (1 — r k )u k + r k u. By using the definition of j3 k from Algorithm 1 
observe that 

(1 - T k )u k + TfcU - (3 k = r k (u - 7) = v - (3 k . (48) 

Furthermore, v € A n since it is a convex combination of £ A n and 
u € A n . Plugging (48) into (47) 

J„ k+1 (ck+i) < D(f3 k ) + max \~ ||v - (3 k \\ 2 + (VD(ft),v- /3 fe >) 

veA„ { z ) 

L 2 

(using (19) and defn. of u fc+ i) = D((3 k ) + (VD(/3 k ), u k+ i - (3 k ) - - ||u fe+1 - /3 fc || 
(Since — D is L — Z.c.5) < D(ufc + i). 

B A linear time algorithm for a box constrained 
diagonal QP with a single linear equality con- 
straint 

In this section, we focus on the following simple QP: 

1 71 

-Y^d^-mif (49) 
i=i 

s.t. U <ai < Ui Vi € [n]; 

n 



mm 

2 

i=i 



OiOli = z. 

-1 



Without loss of generality, we assume U < Ui and di 7^ for all i. Also 
assume Oi 7^ because otherwise ai can be solved independently. To make 
the feasible region nonempty, we also assume 

Oi(5{ai > 0)k + 5{ai < 0)ui) <z<Y ai(5(ai > 0)m + 5{ai < 0)k). 



The algorithm we describe below stems from Pardalos and Kovoor [1990] 
and finds the exact optimal solution in O(n) time. 

With a simple change of variable = (Tiion — mi), the problem is sim- 
plified as 



1 " 



min \ 

i=i 



s.t. I't <Pi < u i Vi G [n]; w ^ ere 



i=i 



cTj(Zi - mj) if Oi > 
crj(ui - mi) if < ' 

(7i(«i - mi) if o-j > 
o-j(/i - mi) if o-j < ' 

1 i 

We derive its dual via the standard Lagrangian. 

l = Jl^tf - XX(a - o +I>r(A - - A ( E& - A ■ 

i i i \ i / 

Taking derivative: 
dL 

— = a 2 i/ 3 i -pf + p7-\ = ft = dr 2 (p+-p7 + \). (50) 

Substituting into L, we get the dual optimization problem 
mmD(X, p+,p7) = d -\pi ~ P- + A) 2 - E pPi + E ^ " A/ 

i i i 

s.t. pt>0, p~>0 Vi€[n]. 
Taking derivative of -D with respect to A, we get: 

"£dr 2 (pt-p- + X)-z' = 0. (51) 



The KKT condition gives: 



pt(Pi-li) = 0, (52a) 
p~{fa-u l ) = Q. (52b) 

Now we enumerate four cases. 

1. pf > 0, p^ > 0. This implies that /• = fa = u^, which is contradictory 
to our assumption. 

2. pf = 0, pT = 0. Then by (50), fa = d~ 2 A G [Zj.uJ], hence A G 

3. p+ > 0, pT = 0. Now by (52) and (50), we have l\ = fa = dr 2 (pf + \) > 
d^ 2 X, hence A < dfl[ and pf = - A. 

4. p+ = 0, > 0. Now by (52) and (50), we have w- = fa = d~ 2 (-p^ + 
A) < d~ 2 X, hence A > dfu^ and pT" = -Sfu^ + A. 

In sum, we have pf = [dfl^ — A]+ and = [A — d?u£|+. Now (51) turns 
into 

/(A) := Y,d7\Wi ~ A]+ - [A - + A) -z' = 0. (53) 

i v ' 

=:fci(A) 

In other words, we only need to find the root of /(A) in (53). hi{\) is given 
by 

r if \<& i i' i 

fti(A) = { AJr^ if ^ < A < ^ (54) 
I uj if A>d>< 

Note that /ij(A) is a monotonically increasing function of A, so the whole 
/(A) is monotonically increasing in A. Since /(oo) > by z' < Yli u 'i an d 
/(— oo) < by z' > J2i l'n the root must exist. Considering that / has at 
most 2n kinks (nonsmooth points) and is linear between two adjacent kinks, 
the simplest idea is to sort {(if <if -u- : % G [re]} into < ... < s^ 2n \ If 
f(s^) and f(s( t+1 ^) have different signs, then the root must lie between them 
and can be easily found because / is linear in [sW,s^ +1 ']. This algorithm 
takes at least 0(n log re) time because of sorting. 



Algorithm 2: O(n) algorithm to find the root of /(A). Ignoring 
boundary condition checks. 

Input: Function / 

Output: A*: Root of / 

1 Initialize: Set kink set S <- {dfZ< : i G [n]} U {dfu- : £ G [n]}.; 

2 while |5| > 2 do 

3 Find median of S: m <- MED (5). 

4 if /(m) > then 
I S <- {i £ S : i < m}. 

else 

| S <- {i 6 5 : i > m}. 



8 Return root ^jfy-f^l) wnere & = {^ u }- 



However, this complexity can be reduced to 0(n) by making use of the 
fact that the median of n (unsorted) elements can be found in 0(n) time. 
Notice that due to the monotonicity of /, the median of a set S gives exactly 
the median of function values, i.e., /(MED(S)) = MED({/(x) : x G S}). 
Algorithm 2 sketches the idea of binary search. The while loop terminates 
in log 2 (2n) iterations because the set S is halved in each iteration. And in 
each iteration, the time complexity is linear to \S\, the size of current S. 
So the total complexity is 0(n). Note the evaluation of f(m) potentially 
involves summing up n terms as in (53). However by some clever aggregation 
of slope and offset, this can be reduced to 0(|5|). 

C Learning SVMs with MEB 

Kernel methods in general and support vector machines (SVMs) in partic- 
ular have received significant recent research interest in machine learning 
[Scholkopf and Smola, 2002]. Underlying a SVM is a simple geometric idea. 
Given a training set {(xj, yi)}" =1 of n points Xj labeled by m G {±1} the aim 
is to find the hyperplane which maximizes the margin of separation between 
points from different classes. This is compactly written as the following 



optimization problem (see Scholkopf and Smola [2002] for details): 



1 n 
|| w f + ( 55a ) 



mm 

wef 2 

s.t. 2/j((w, Xj) + 6) > 1 — £j for all i. (55b) 

Standard duality arguments (see e.g. Boyd and Vandenberghe [2004]) yield 
the following dual Quadratic Programming (QP) problem 

max (oL.e) ot J Ka (56a) 

n 

s.t. ajy,, = (56b) 

i=l 

< a, < C for all i. (56c) 

Here -fT is a m x m matrix whose entries are given by y%yj (xj,Xj) and 
e denotes the vector of all ones. Since the dual only depends on x via the 
inner products (xj , Xj ) one can employ the kernel trick: map £j into a feature 
space via 0(xj) and compute the dot product in the feature space by using 
a kernel function &(xj,Xj) := (</>(xj), <t>(x-j)) Scholkopf and Smola [2002]. 
The kernel trick makes SVM rather powerful because simple linear decision 
boundaries in feature space map into non-linear decision boundaries in the 
original space where the datapoints live. 

A number of different techniques have been proposed for solving the 
quadratic problem associated with SVMs. Of particular interest in our con- 
text is the Core Vector Machine (CVM) [Tsang et al., 2007, 2005]. The key 
idea of the CVM is the observation that solving (56) is equivalent to finding 
the MEB of the feature vectors. The MEB problem in feature space can be 
written as (see Tsang et al. [2007] for details) 

min R 2 (57a) 



||c-0( Xi )|| 2 < R 2 for all! (57b) 
The dual of the above minimization problem then becomes (also see (13)) 



^OLiK it i - a T Ka (58a) 
i=l 

s.t. cv;>0, J^Oi = l (58b) 



max 

a 

1=1 



where Kij = (<^(xj), (f>(x-j)) as before. In particular, if each equals a 
constant c then it can be shown that by a simple transformation that the 
standard SVM dual (56) and the CVM dual (58) can be identified. There- 
fore, every iteration of the CVM algorithm [Tsang et al., 2005] identifies an 
active set of points, and computes the MEB of this active set. This is done 
via the coreset algorithm of Panigrahy [2004], hence the name core vector 
machine. In fact, our algorithm for the MEB can directly be plugged into 
the CVM, and improves the rates of convergence of the inner iteration from 
0(nd/e) to 0(nd/ '\/e). We hope that our algorithm with improved conver- 
gence rates can also be used in similar machine learning algorithms to speed 
up their convergence. 

D Experimental Results 

The aim of our experiments is to demonstrate the efficacy of Algorithm 1 and 
compare its performance with existing MEB algorithms in terms of running 
time and number of iterations required for convergence. Following [Yildirim, 
2008] we generate data using a random multivariate Gaussian distribution 
and vary n the number of data points and d the dimensions. For each fixed 
n and d we generate 5 random datasets and report the average performance 
of our algorithm with the multiplicative guarantee e = 10~ 3 in Table 1. 
Recall from Section 3.2 that the multiplicative guarantee e is equivalent to 
the additive tolerance eP 2 , where V is chosen via 

-D 1 II II 

P = - max ||Xj — Xj|| . 

For reference we also reproduce the results reported in Table 1 of Yildirim 
[2008]. Although not a fair comparison, the CPU times gives an indication 
of the relative performance of various algorithms. In the table BC refers to 
the coreset algorithm of Badoiu and Clarkson [2002] while Al and A2 are 
MEB algorithms implemented in Yildirim [2008]. 

Our algorithm performs significantly better than BC and Al, while being 
comparable to A2. Furthermore, our algorithm usually takes smaller number 
of iterations and particularly shines when the number of points is large. 
Our implementation is preliminary, and we believe that our algorithm will 
benefit from practical speed ups in much the same way that algorithm A2 
was obtained by improvising Al to get rid of redundancies. 



n 






Time 






Iterations 




d 


Al 


A2 


BC 


Ours 


Al 


A2 


BC 


Ours 


500 


10 


0.06 


0.03 


0.12 


0.04 


168.7 


44.5 


435.5 


44.2 


1000 


10 


0.15 


0.03 


0.14 


0.10 


330.7 


41.6 


344.4 


54.5 


5000 


20 


1.7 


0.36 


3.11 


1.08 


246.8 


46 


464.2 


69.7 


10000 


20 


4.46 


0.58 


4.65 


3.40 


319.2 


36.3 


334.4 


105.2 


30000 


30 


27 


6.45 


24.59 


10.43 


446.4 


103.6 


409 


77.8 


50000 


50 


71.62 


16.87 


68.78 


18.69 


429.8 


98.4 


415.1 


54.5 


100000 


100 


287.99 


77.74 


268.11 


83.18 


451.7 


119 


422.6 


63 



Table 1: Computational Results with n » d (e = 10 3 ) 



